Rate-distortion theory

Rate-distortion theory (Shannon, 1959) characterizes the fundamental tradeoff between Compression and fidelity. Given a source and a limited-capacity channel, it asks: what is the minimum number of bits (rate $R$ ) needed to represent the source such that expected distortion stays below a threshold $D$ ?

The rate-distortion function $R(D)$ is: $R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(x,\hat{x})] \leq D} I(X; \hat{X})$

Lower distortion (better fidelity) requires higher rate (more bits). The Lagrangian form minimizes: $R + \lambda D = I(X; \hat{X}) + \lambda \, \mathbb{E}[d(X, \hat{X})]$

The two terms measure different things. $I(X; \hat{X})$ (Mutual information) is the rate — how much knowing the compressed representation $\hat{X}$ tells you about the original $X$ . If the encoder throws everything away ( $\hat{X}$ independent of $X$ ), $I = 0$ : zero bits, maximum distortion. In the lossless discrete limit, preserving everything gives $I = H(X)$ : you need as many bits as the source has entropy. The mutual information tells you how many distinct codewords the encoding scheme effectively needs — more mutual information means finer distinctions preserved, more bits to transmit. It’s an aggregate property of the whole encoding scheme, not of any single sample.

In contrast, $d(x, \hat{x})$ is a pointwise distortion function that scores a single reconstruction against the original (e.g., squared error).

Lower distortion requires higher rate (more bits to preserve finer distinctions), but this is a property of optimal encodings — a bad encoder can waste bits and still have high distortion. $R(D)$ traces the Pareto frontier. The objective minimizes both, with $\lambda$ controlling the tradeoff.

Information bottleneck

The Information bottleneck method can be viewed as an RD-style objective over representations: minimize $I(X; Z)$ while maximizing predictive information $I(Z; Y)$ . The canonical objective is $\min I(X; Z) - \beta I(Z; Y),$ and one common RD interpretation uses a KL-based distortion term (e.g., between $p(y|x)$ and $p(y|z)$ ) rather than reconstruction error.

Link to VAE

In deep learning, the $\beta$ -Variational autoencoder embodies the rate-distortion tradeoff via the ELBO (Evidence Lower Bound):

$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \beta \, D_{\mathrm{KL}}(q(z|x) \| p(z))$

The terms map onto the rate-distortion Lagrangian (with expectation over the data distribution):

$\mathbb{E}_{q(z|x)}[\log p(x|z)]$ is the negative distortion: how well can you reconstruct $x$ from the latent $z$ ?
$\mathbb{E}_{p_{\text{data}}(x)}[D_{\mathrm{KL}}(q(z|x) \| p(z))]$ is the rate: how much the encoder’s posterior deviates from the prior on average. If the encoder ignores the input and outputs the prior, KL = 0 (zero rate). If it encodes fine-grained distinctions, KL is large. The expected KL is an upper bound on $I(X; Z)$ .

So the ELBO has the same structure as $-(\lambda D + R)$ : maximize fidelity minus $\beta$ times rate. $\beta > 1$ tightens the bottleneck — typically more compression and often better disentanglement, at the cost of worse reconstruction.

Rate-distortion theory

Related

Information bottleneck

Link to VAE

See also